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The asymptotic pseudo-trajectory approach to stochastic approx- 
imation of Benai'm, Hofbauer and Sorin is extended for asynchronous 
stochastic approximations with a set-valued mean field. The asyn- 
chronicity of the process is incorporated into the mean field to pro- 
duce convergence results which remain similar to those of an equiva- 
lent synchronous process. In addition, this allows many of the restric- 
tive assumptions previously associated with asynchronous stochastic 
approximation to be removed. The framework is extended for a cou- 
pled asynchronous stochastic approximation process with set- valued 
mean fields. Two-timescales arguments are used here in a similar 
manner to the original work in this area by Borkar. The applicabil- 
ity of this approach is demonstrated through learning in a Markov 
decision process. 



1. Introduction. Many learning algorithms include a stochastic up- 
dating schedule, often based on a Markov chain. Studying the performance 
of these processes can be carried out using the asynchronous stochastic ap- 
proximation framework. However, the previous work in this area has focused 
on continuous, single- valued updates as discussed in the literature (see for ex- 
ample [8], [13], [16], [17], [23]). Furthermore some of the assumptions which 
are typically used are challenging to verify. In this work we expand the 
asymptotic pseudo-trajectory approach of Benai'm, Hofbauer and Sorin [4] 
to asynchronous stochastic approximations with set-valued mean fields. We 
incorporate the asynchronicity into the mean field to give a differential inclu- 
sion which will characterise the limiting behaviour of the associated learning 
process. 

Consider an iterative process {x n } n ^ where x n £ R-^ and denote the 
i th component of x n as x n (i) where i £ I and K = \I\ is finite. A typical 
stochastic approximation (SA) is of the form 



(1.1) + a(n + 1) [F(x n ) + V n+1 + d n+1 ] , 
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where {a(n)} ne pj is a positive, decreasing sequence, {V^} ng N is a zero-mean 
martingale noise sequence, {d n } nS N is a bounded sequence which converges 
to zero and F(-) : M. K — > M. K is a Lipschitz continuous mean field. Standard 
arguments (e.g. [3]) are then used to show that the limiting behaviour of the 
iterative process in (1.1) can be studied through the ordinary differential 
equation (ODE) 



(1.2) | = F(,). 

Commonly known as the ODE method of stochastic approximation, origi- 
nally proposed by Ljung [19], this technique has been extended by numerous 
authors, for example Benai'm [2], Benaim, Hofbauer and Sorin [4], Borkar 
[10], Kushner and Clark [15] and Kushner and Yin [16], [17]. In particu- 
lar Benaim, Hofbauer and Sorin [4] have developed the approach so that 
under some weak criteria {2; n }neN can be updated via a set-valued mean 
field, F(-). This allows for the limiting behaviour to be studied using the 
associated differential inclusion. 

Standard stochastic approximations are not always applicable; an example 
which we examine in this paper is when learning action values in a Markov 
decision process (MDP) and this is also discussed by Konda and Tsitsiklis 
[14], Tsitsiklis [23] and Singh et al. [21]. In a MDP updates are made to a 
single random component at each iteration. Therefore we have a stochastic, 
asynchronous updating pattern, where a subset of an iterative process similar 
to (1.1) can be updated many times before the remaining components are 
selected for a single update. Based on this idea extensions to the standard 
theory have been examined such as those by Kushner and Yin [16], [17]. Here 
however we follow the extension to asynchronous stochastic approximation 
provided by Borkar [8] and Konda and Borkar [13]. They show that when the 
iterative updates have a Lipschitz continuous mean field then, similarly to 
a standard stochastic approximation, the limiting behaviour can be studied 
via the associated differential equation, 



(1.3) — = M(t)F(x), 

where M(t) is a K x K diagonal matrix and the diagonal elements of M(t) 
lie in the set [0, 1] for all t > 0. This early work on asynchronous stochastic 
approximations has certain restrictions which limit its usability. In particu- 
lar, many of the assumptions made in the work of Borkar [8] and Konda and 
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Borkar [13] are given in implicit form and are difficult to verify in specific 
situations. 

As with the initial results for a standard stochastic approximation the 
results of Borkar [8] are limited to the case when the mean field, F(-), 
is a Lipschitz continuous function. The subsequent work by Benai'm, Hof- 
bauer and Sorin [4] on set-valued mean fields leaves the natural question of 
whether similar results are possible for asynchronous stochastic approxima- 
tions when a set-valued mean field is used. In addition, the ODE in (1.3) is 
non-autonomous and the scaling matrix M(-) is not explicitly defined. This 
makes analysis of the limiting behaviour more difficult to study, although 
some methods for verifying global convergence are outlined by Borkar [10]. 

Borkar [7] originally extended the stochastic approximation framework to 
two-timescales. Since then Leslie and Collins [18] extended this idea to mul- 
tiple timescales and Konda and Borkar [13] provide a first venture into the 
two-timescale asynchronous stochastic approximation. However, all of these 
only consider stochastic approximations when the mean field is Lipschitz 
continuous. 

The aim of this work is to combine and generalise the results by Borkar 
[7], [8], Konda and Borkar [13] and Benai'm, Hofbauer and Sorin [4] to create 
a framework for single and two-timescale asynchronous stochastic approx- 
imations which is straightforward to use in practical applications. In this 
paper we show that, under a set of verifiable assumptions, the diagonal el- 
ements of M(t) lie in the closed set [e, 1], for some e > 0. The set [e, 1] can 
be combined with the mean field F(-) to form a set-valued mean field, F(-), 
whose limiting behaviour can be studied via the associated differential inclu- 
sion using the results of Benai'm, Hofbauer and Sorin [4]. A natural benefit 
of using the differential inclusion framework is that F(-) can be set-valued 
as this does not alter the analysis. 

This paper is organised in the following manner: Section 2 reviews some 
previous results on stochastic approximation with differential inclusions and 
asynchronous stochastic approximations. In Section 3 we focus on the single- 
timescale asynchronous stochastic approximation. We state the main theo- 
rem before presenting the weak convergence results required for the proof. 
Section 4 examines the extension to a two-timescale asynchronous stochas- 
tic approximation process. Large parts of this section follow directly from 
the results in Section 3. In Section 5 we present an example of a learning 
algorithm for discounted reward Markov decision processes and obtain con- 
vergence results by applying the method shown in Section 4. This illustrates 
the ease in which this framework can be used. Finally, the paper concludes 
with a summary of the work. Throughout this paper many of the proofs are 
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omitted from the main flow of text and are instead presented in an appendix. 

2. Background. Throughout this paper we use two main ideas from 
the stochastic approximation literature. The first relates to the work by 
Benaim, Hofbauer and Sorin [4] on stochastic approximation with differ- 
ential inclusions, and the second concerns the asynchronous stochastic ap- 
proximation framework introduced by Borkar [8]. We take the opportunity 
to review the pertinent features of their work in this section. 

In what is to follow we use the standard concept of set multiplication: if 
A C R KxK is a set of K x K matrices and B CR K are two closed, convex 
sets then let the multiplication of these sets be defined as, 

A ■ B = {a ■ b; a e A, b G B) C R K . 

Note that A-B is also closed and convex. This definition is still used if either 
or both of the sets A and B are single valued. We also use the same concept 
when multiplying a constant by a set. That is, if a is a constant then define 

a ■ B = {a ■ b;b € B} C R K . 
However, in this latter case we often drop the '•' notation for convenience. 

2.1. Stochastic Approximation with Differential Inclusions. We begin by 
outlining the current convergence results for stochastic approximations with 
set-valued maps proved by Benaim, Hofbauer and Sorin [4]. These results are 
heavily used in Section 3, most notably to prove our main result. Initially 
we provide a definition which outlines the class of set-valued mean fields 
we are able to use for stochastic approximation. These criteria are taken 
directly from the original work on stochastic approximations with differential 
inclusions by Benaim et al. [4]. 

Definition 2.1. Call F(-) : M. K — > M. K a stochastic approximation map 
if it satisfies the following 

(i) F(-) is a closed set- valued map. That is, 

Graph(F) = {(x,y);y € F(x)} , 

is a closed set. Equivalently, F(-) is an upper semi-continuous set- 
valued map. 

(ii) For all x € R K , F(x) is a non-empty, compact, convex subset of R K . 
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(iii) There exists a c > such that for all x € R , 

sup ||z|| < c (1 + ||x||) 

z<=F(x) 

Take -F(-) : M. K — > M. K as a stochastic approximation map; a typical 
differential inclusion is in the form, 



(2.1) * 6 F(,), 

and a solution to (2.1) is an absolutely continuous mapping x : E — > M A 
such that x(0) = x and for almost every i > 0, 

MS, Fm . 

The /Zou; induced by (2.1) is defined by, 

$t(x) = {x(i);x(-) is a solution to (2.1) with x(0) = x} , 

Definition 2.2 (Benaim et al. [4]). A continuous function x : R + 
is an asymptotic pseudo-trajectory to <3> if 



lim sup d( x(t + s), <5> s (x(t)) ) = 0, 
*-+ oo «e[0,n V 7 

for any T > and where d(-, •) is a distance measure on 1^ . 

Many important properties of a dynamical system {^t(-)}t>o an d the 
asymptotic pseudo-trajectories of the systems are discussed by Benaim, Hof- 
bauer and Sorin [4]. Most important for the work here is that an asymptotic 
pseudo-trajectory to (2.1) will behave in a similar manner to the solutions 
of the differential inclusion and hence the limiting behaviour will be closely 
related. 

We conclude this section by considering a standard iterative process in the 
form of (1.1) where the mean field F(-) : M. K —> W K is a stochastic approx- 
imation map. The following theorem states that under four assumptions 
a linear interpolation of {x n } n< =^ (a function defined precisely in Section 
3.1) is an asymptotic pseudo-trajectory to the differential inclusion (2.1). 
Hence the limiting behaviour of {x n } n( =N can be studied via this differential 
inclusion. 



Theorem 2.3. Assume that 



G 



(i) For allT>0 



(2.2) lim sup 



k 



k-l 



J2a(i + l)V i+1 



■k = n + l,... ,m(r n + T) 



where tq = 0, r n = Yli=i a W an< ^ fn{t) = sup{A; > 0; t > r^}, 
(ii) sup n ||x n || = x < oo, 
(Hi) F(-) is a stochastic approximation map, 
(iv) d n — > as n — > oo and sup n ||d n || = d < oo. 

Then a linear interpolation of the iterative process {x n } nS N given by (1.1) 
is an asymptotic pseudo-trajectory of the differential inclusion (2.1). 

This is a slight modification to a result stated by Benai'm, Hofbauer and 
Sorin [4, Proposition 1.3] to include the {d n } nS N terms. It is trivial to verify- 
that this will not alter any of the asymptotic results of the original work. 

2.2. Asynchronous Stochastic Approximations. Now we fully introduce 
the asynchronous stochastic approximation notation used. A typical asyn- 
chronous stochastic approximation such as those studied by Borkar [8] fits 
the following framework. If 2 1 is the power set of all possible updating com- 
binations in / then let I n 6 2 1 be the components of the iterative process 
{x n } ne N updated at iteration n. Using a counter for state i £ I, 

n 

v n (i) :=X) W*}' 
k=l 

we consider processes in which no component, i, in the asynchronous pro- 
cess needs to know the global counter, n, merely its own counter, f n (i). 
Let Fi(-), x n (i), V n (i) and d n (i) be the i th component of F(-), x n , V n and 
d n respectively, for i = 1, . . . ,K. We directly extend the notation used in 
(1.1) for an asynchronous stochastic approximation; let F(-) be a stochastic 
approximation map, then for i = 1, . . . , K let 



(2.3) x n+ i{i) e x n {i) + a(f„ + i(z))l {ig / n+l} [F,i(x n ) + K+iOO + d n+1 (i)] . 

Define the asynchronous step sizes, a n , and the relative step sizes, n n (i), 
to be 

( f\\ c\ a ( M «( i )) Tr 
a n := maxa(i/ n (!)), fi n (t) := = ^%e/„}- 
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The asynchronous step sizes, a n , are random step sizes (in contrast with 
the deterministic a(n) terms) whilst the relative step sizes, /J, n (i) are zero 
whenever the i th component of the iterative process is not updated. Clearly 
lJL n {i) £ [0,1]. By letting M n be the K x K diagonal matrix of the /i n (i) 
terms we can express the previous asynchronous stochastic approximation 

(2.3) in the more concise form 

(2.4) x n+1 - x n - a n+1 M n+ i [V n+1 + d n+1 ] € a n+1 M n+1 ■ F(x n ). 

This is a more familiar form for a stochastic approximation with a set- 
valued mean field. If F(-) is a stochastic approximation map in (1.1) then 
(2.4) differs from (1.1) only in that the step sizes in (2.4) are random and the 
addition of the M n+ \ coefficient. Instead of thinking of M n+ \ as a coefficient 
of the step sizes we combine it with the mean field. Convergence of the error 
term, d n +\, will be unaffected and, under a set of assumptions in Section 
3.2, the noise term M n+ iV n +i will still satisfy the Kushner and Clark noise 
condition (2.2). Combining the M n+ i term and the mean field term F(-) 
into a single set provides an intuitive method of rephrasing the stochastic 
approximation and leads to a set- valued mean field. 

Proceeding with this intuition is not immediately straightforward since 
M n is time varying and can be zero infinitely often, in which case the mean 
field could be zero even when the original update term, F(-), is not. This 
would mean the limiting behaviour of the differential inclusion in the asyn- 
chronous stochastic approximation could be different to the synchronous 
case, where ultimately we wish to say that the two behave in the same man- 
ner in the limit. To avoid this scenario we follow Borkar [8] and consider the 
weak limit of the interpolations of {M n } nG N, which will always be bounded 
away from zero under some verifiable assumptions, given in Section 3.2. 

3. Asynchronous SA with Differential Inclusions. We begin by 
presenting the main result of this paper which concerns the limiting be- 
haviour of the asynchronous stochastic approximation in (2.4), before out- 
lining the results required to prove this in the remainder of the section. 

3.1. Main Result. Assume that F(-) is a stochastic approximation map 
and for all n define f n € F(x n ) by its component parts, f n (i), i = 1, ■ ■ ■ , K , 
such that 

(3.1) / n (iK+i« := Xn+1 ® ~ Xn ® - Mn+1 (i) [ Vn+1 (i) + d n+1 (i)] . 

ttn+l 



8 



Notice that if /x n+ i(i) = then we can select any f n (i) G Fi(x n ). Then 
we can write the iterative process in (2.4) as 



(3.2) 



x n +\ =x n + a n+1 M n+ i 



fn + V n +\ + d n+ \ 



For some fixed e > 0, let M n be a series oi K x K diagonal matrices with 
entries in the set [e, 1], for all n, to be defined in Section 3.3. We can again 
rewrite the iterative process in (3.2) as 



Mn+lfn + (M n+ i - M n+1 )f n + M n+1 V n+1 + M„ + id„ + i 



Now by letting K+i = f n (M n+1 -M n+x )+M n+1 V n+ i and d n+i = M n+1 d n+1 
we get, 



(3.3) x n+ \ — x n + a n +i 

For general fc, 5 > let 



M n+ if n + Kt+i + <i n +i 



(3.4) n s k := {diag(cJi ) ...,w fe );Wi G [5, l],Vi = 

and define 



(3.5) 



F(x) := n £ K ■ F(x) 



If F(-) is Lipschitz continuous direct comparisons can be made between the 
mean field, F(x), and the analogous mean field M(t)F(x) from (1.3) which is 
used by Borkar [8] and Konda and Borkar [13]. This provides the key insight 
into the new approach we take. Under the assumptions used in Section 3.2 
the equivalent M(t) values almost surely lie in £l £ K . By combining this with 
F(x) we produce a differential inclusion which is more straightforward to 
study than a non-autonomous differential equation and naturally fits the 
stochastic approximation framework of Benaim, Hofbauer and Sorin [4]. In 
addition, this idea naturally lends itself to examining a similar process for a 
set-valued mean field as we proceed to do in this paper. 

Equation (3.3) can be expressed in the form of a stochastic approximation 
with a set- valued mean field as in [4]: 
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(3.6) x n+ i - x n - a n+ iV n+ i - a n+ id n+ i E a n+ iF(x n ). 

Let fo := 0, f n := Ylk=i ®k ^ e ^ e timescale for the asynchronous updates. 
To allow this process to be analysed in continuous time consider an interpo- 
lated version of the stochastic approximation (3.6) so that this process can 
be considered in continuous time, 



(3.7) x(r n + s) = x n + s-^-z -, s£[0,a n+1 ). 

«n+l 

Under the assumptions (Al)-(A5), presented in Section 3.2, we show in 
Section 3.3 that a sequence, {M n }neN) can be defined such that {a n } n£ N 
and {V^jneN satisfy the Kushner and Clark noise condition in (2.2). By 
invoking Theorem 2.3 we obtain our main result, which is proved in Section 
3.4. 

Theorem 3.1. Under the assumptions (A1)-(A5), with probability 1, 
x(t) is an asymptotic pseudo-trajectory to the differential inclusion, 



p.8) £ € m- 

Directly from Theorem 3.1 and [4, Proposition 3.27] we get the key result 
concerning the convergence of an asynchronous stochastic approximation 
process. 

Corollary 3.2. If there is a globally attracting set, A, for the differ- 
ential inclusion (3.8), and assumptions (Al)-(A5) are satisfied, then the 
iterative process (2.4) will converge to A. 

3.2. Assumptions. Throughout this section we study the convergence 
properties of the iterative process (2.4). We make reference to the follow- 
ing assumptions, (A1)-(A5), all of which are either standard requirements 
for a stochastic approximation or can be verified prior to running the asyn- 
chronous stochastic approximation process. This is in contrast with the pre- 
vious work on asynchronous stochastic approximations by Borkar [8] and 
Konda and Borkar [13]. 

(Al) (a) For a compact set, C C R , i„fC for all n. 

(b) {c? n } ne N is a bounded sequence such that d n — > as n — > oo. 



10 



(A2) Let a(n) satisfy the following criteria, 

(a) ^2 n a(n) = oo and a(n) — > as n — > oo, 

(b) For x G (0,1), sup n a([xn])/a(n) < A x < oo, where [•••], means 
the "integer part of". In addition, for all n, a(n) > a(n + 1). 

(A3) F(-) is a stochastic approximation map. 

(Al)(a) is a slight strengthening of the standard stochastic approximation 
boundedness assumption; however this is still a relatively mild condition. 
Methods to ensure that it is satisfied are discussed elsewhere, for exam- 
ple [12], [13] or [23]. A basic restriction is placed on {c? n }neN m (Al)(b); 
in this form the sequence does not affect the asymptotic behaviour of the 
process. (A2)(a) is a standard assumption required for stochastic approxi- 
mation, and (A2)(b) is a mild technical condition required to deal with the 
asynchronicity, which is also used by Borkar [8]. We have dropped the ad- 
ditional restriction on the step-sizes used by Borkar which severely restricts 
the possible choices of {a(n)} ne pj. (A3) ensures that we can use the con- 
vergence results presented in Theorem 2.3 and is a standard assumption for 
stochastic approximations with a set-valued mean field. 

Define I C 2 1 as the set of all the possible combinations which have 
positive probability of occurring. As an example, if every element of I gets 
updated and it is known that I n is a singleton for each n, then 1 = 1. 

Let J- n be a sigma algebra containing all the information up to and in- 
cluding the n th iteration. That is T n := a({I m } m , {x m } m , {%(i)} !im ;Vra < 
n,i = 1,...,K). 

(A4) (a) For all x G C, X n ,2" n+ i G 7, 



^[In+l — In+ll^n) — PUn+1 — Zn+l\In — Zn,Xn — %J- 

Let 

P(T n ,X n+1 ) 0*0 := P(in+1 = 2^+1 % = 2n, X n = x\ . 

(b) For all x G C the transition probabilities P(x„,l n+1 )( x ) form an 
aperiodic, irreducible, positive recurrent Markov chain over I and 
for alii G I there exists an X G / such that i G X. 

(c) The map x i— >• P{x n ,x n+1 ){ x ) ls Lipschitz continuous. 

(A4)(a) assumes that the transitions between the updated elements in 
I are part of a controlled Markov chain. (A4)(b) is a straightforward as- 
sumption on this controlled Markov chain which can be verified prior to 
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implementation which allows us to negate the need for some of the original 
technical assumptions made by Konda and Borkar [13]. In this previous work 
Konda and Borkar assume that every state is updated at a comparable rate 
in the limit which cannot directly be verified prior to running the process. 
(A4)(c) is a condition which is required later to use a result from Ma et al. 
[20] on the convergence of stochastic approximation with Markovian Noise. 

(A5) (a) For some q > 2 

Va(n) 1+?/2 <oo, and supE(||K|| 9 ) < oo; 

n 

n 

(b) Take V n (i) independent of V n (j) for i 7^ j and V n (i) independent 
of I n given T n -\ for all z = 1, . . . , K. Let {a, b) = akbk- Then 
there exists a positive V such that for all 6 £ M. K , 

^ e -c/a(n) < ^ 



and 

E 



exp{(fl,y n+1 )}|7;] <exp{^||#|| 2 }; 
for each c > 0. 

We say that (A5) holds if either (A5)(a) or (A5)(b) is true. 

An assumption similar to (A5) is used by Benai'm, Hofbauer and Sorin 
[4] to verify a condition for noise convergence and is similar to that used 
by Kushner and Clark [15]. We use this assumption only to show the noise 
term still satisfies the Kushner-Clark condition with the convergence given 
in Lemma 3.3; the proof is presented via two lemmas in Appendix A. 2. 

Lemma 3.3. Assume that (A2)(b), (A4) and (A5) hold. Then with prob- 
ability 1, for all T > 0, 



, fc-l 

lim sup<^ \\y^a i+ iM i+1 V i+ i 

l — ^oo ^ || * » 



; k = n + 1, . . . ,m(r n + T) 



where tq := 0, r n := Ylk=i a k an d fn{t) '■= sup{/c > 0;t > r^}. 

Note that if Lemma 3.3 can be verified directly without (A5) then this 
assumption is redundant, and hence we only require (Al)-(A4) and Lemma 
3.3 to hold. This approach is used in Section 4. 
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3.3. Weak Convergence of Asynchronous Updates. As discussed in the 
introduction, the key issue with asynchronous stochastic approximations is 
how to handle the interaction of the relative step sizes and the mean field. 
It is important to be able to bound the limit of the relative step sizes, 
// n (i), away from zero for all i G I in order to produce an asynchronous 
stochastic approximation mean field, F(-) which will behave similarly to the 
synchronous mean field, F(-). However the relative step size of a state is 
zero whenever that state is not updated, hence it is not immediately clear 
that this is even possible. Despite this it is sufficient that for any T > 
an 'average' of fj, n (i) over length T in the continuous time interpolation 
converges to a value which is bounded away from zero. In this section we 
prove that under (A2)(b) and (A4) this is indeed the case. 

For T > the space L 2 ([0, T]) is the set of measurable functions h(-) : 
R -> R such that, 



Following the method used in [10] and [13], define U to be the space of 
maps u(-) : R — > [0, 1] with the coarsest topology which for all T > leaves 
continuous the map, 



for all h(-) G L 2 ([0, T]). Hence U is a space of [0, 1] trajectories. This means 
that for any map defined on U convergence to a limit point will be in the 
weak sense, along a subsequence. That is a sequence of maps {^> n (-)}neN such 
that (p n (-) G li for all n is said to possess a limit point <p(-) G IA if for fixed 
T > there exists a subsequence k(n), such that for any h(-) G L 2 ([0,T]), 

(3.9) / h(s)tpk( n -)(s)ds —> / h(s)ip(s)ds, as n — > oo. 



Many authors provide a more detailed discussion on weak convergence; for 
example [6], [10, Appendix A] or [11]. 

Now we extend the relative step sizes, fj, n , to continuous time; for alH G I 
let Ui(t) = fJLn+i(i) for t € [f n ,f n+1 ) and let u(-) = (ui(-), . . . , u K (-)) ■ For 
alH G / and t > define uf(t) := m(t + f n ) G U. 

Lemma 3.4. Under (A2)(b) and (A4), for all i £ I and for any T > 0, 
{^?(')}neN converges along a subsequence to a limit point iti(-) such that for 
some e > and any < v < T, 
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l-V 

(3.10) / Ui(s)ds>ve, a.s.. 

Jo 

PROOF. See Appendix A. 3. □ 



Corollary 3.5. For any T > let u(-) be a limit point of {tt n (-)}neN 
in hi, then under (A2)(b) and (A4) there exists an e > such that for all 
i € I and any t, v > such that t + v < T , 



/t+v 
Ui{s) ds > ve, a.s.. 

PROOF. See Appendix A. 3. □ 



We now expand upon the discussions in Sections 2.2 and 3.1 on producing 
a sequence of matrices {M n } n ^. In order to use the differential inclusions 
framework described in Section 2.1 we need to define a sequence of diagonal 
matrices, {M n }n£N with diagonal entries which are always in the set [e, 1], 
for some e > 0, and such that the terms converge to the same limit as the 
terms of {M n } n ^. Recall that M n is a diagonal matrix containing the fi n (i) 
terms. 

Fix e > taken from Lemma 3.4 and define a new function v(-) : R — > M. K 
such that 

Vi(t) := max|itj(i),e}. 

For all t > let vf(t) := Vi{t + f n ). Corollary 3.5 shows that, with respect to 
the topology of U, in the limit itj(t) € [e, 1] for almost every t and similarly 
Vi(t) £ [e, 1] for all t. From this it is clear Ui(t) and Vi(t) have the same limit 
point in hi. That is, if u{t) is a limit point of {n n (-)} ne pj then it is also a 
limit point of {?; n (-)} ne M. Hence for any T > there exists a subsequence 
jfe(n) such that for h(-) € L 2 ([0,T]), 



/ Ui(t)h(t)dt= lim / (t)h(t) dt = lim / v- (n) (t)h(t) dt. 

Jo n ^°°Jo n^^Jo 

However, the key interest here is in the convergence of Ui(-) and Vi(-). Fol- 
lowing the reasoning of Borkar [8] and Konda and Borkar [13], it does not 
matter whether uf(-) and vf (■) converge directly or via a subsequence as 
this does not affect the convergence of the continuous processes tti(-) and 
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Vi(-). Hence we can say that Ui(f n + •), Vi{f n + •) converge weakly to a limit 
point ttj(-), or equivalently, if h(-) is any bounded, continuous function then 
for all i = 1, . . . , K, 



(3.11) 



lim 

71— >OC 



Tn+V 



[vi(s) - Ui(s)]h(s)di 



0. 



Define M(t) to be the K x K diagonal matrix of the Vi(t) terms and let 
M n := M(f n ). 

Lemma 3.6. Almost surely under assumptions (A2)(b) and (A4), 



sup 

k 



k-l 



^2a i+1 fi(M i+1 -M i+1 ) 



■ k = n + 1, . . . m(r n + T) 



0. 



Proof. See Appendix A. 4. The proof relies on (3.11). 



□ 



3.4. Proof of Theorem 3.1. We must verify that the four conditions of 
Theorem 2.3 hold for the stochastic approximation process in (3.6) to ascer- 
tain that x(t) is an asymptotic pseudo-trajectory of (3.8). 

Fix T > 0. Then, 



sup 

k 



k-l 

y^Qj+iVj+i 



k = n+ 1, . . . m(r n + T) 



(3.12) <sup 

k 



(3.13) +sup 

k 



k-l 



y^ j a i+1 Mi +1 Vi +1 



i=n 
k-l 



k = n + 1, . . . m(r n + T) 



^a i+1 fi(M i+l - M i+1 ) 



; k = n + 1, . . . m(r n + T) ) . 



Using Lemma 3.3 and Lemma 3.6 immediately gives that (3.12) and (3.13) 
converge to zero a.s., and hence this verifies that property (i) holds. As- 
sumption (Al)(a) directly gives that (ii) holds. Lastly, it is straightforward 
to verify that, under (Al)-(A5), F(-) is a stochastic approximation map 
which verifies condition (iii), and (Al)(b) is equivalent to (iv). □ 
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4. Two-timescale Asynchronous Stochastic Approximation. A 

useful extension of standard stochastic approximations is to two-timescales. 
This concept was originally introduced by Borkar [7] and has later been used 
by Leslie and Collins [18] for multiple timescales and Konda and Borkar 
[13] for two-timescales asynchronous stochastic approximation. If we have 
a coupled pair of stochastic approximations where one system can be seen 
to update more aggressively than the other then the aggressive process is 
always fully adjusted to the value of the other process. This is all controlled 
through the user's choice of step sizes in the stochastic approximation. The 
main result in this section is Corollary 4.8, which comes from combining 
Theorem 3.1 with the previous work of Konda and Borkar [13]. 

4.1. Notation. In what is to follow we consider the extension of Theorem 
3.1 to the two-timescales setting, with updates {x n } nG N and {y n }neN on 
different timescales. Let / be the set of individual elements of the x process 
as in Section 3, and define J similarly for the y process. Let K = \I\ and 
L = | J | so that for all n, x n 6 ~M. K and y n € R L . As in Section 3 let I C 2 1 
be the set containing all combinations of elements in / which have a positive 
probability of being part of the asynchronous update, and define J C 2 J in 
the same manner for the y process. At iteration n let I n E / and J n € J be 
the updated components of each timescale respectively. Let each component 
of the two processes have a counter for the number of times it has been 
selected to be updated defined by, 

n n 

Mi) ■= Yl M) ■■= Yl hitJk}- 

k=l k=l 

Here v n (i) is as in Section 3 and (f> n (J) nas an analogous definition for the 
{Un}n<=N process. Let {V n } ne ^, {U n } n£ ^ be martingale noise processes de- 
fined on 1H K and ]R L respectively, and {<i n } ng N, {e n } ne N — >• as n — >• oo 
similarly defined on M. K and R 1, respectively. Let V n (i),d n (i) € 1 be com- 
ponent i of V n and d n , and similarly let U n (j),e n (j) G R be component j 
of U n and e n . As in the previous sections {a(n)} ng pj, and now {j(n)} ne ^, 
are positive, deceasing sequences of step sizes. Similar restrictions to those 
in (A2) will be placed on {a(n)} nG N, {7(^)}nGN with an additional require- 
ment for the two-timescale arguments to be valid; this will be made precise 
in Section 4.2. Finally, F(; •) : R K x R L -> and G(-, •) : R K x R L -> R L 
are set- valued maps, where Fi(x,y) is the i th value of F(x,y) and similarly 
for Gj(x, y). For all i = 1 . . . , K and j = 1,...,L consider the following 
coupled process, 
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x n+ i(i) -x n (i) 

(4.1) 



€ a(u n+1 (i))l {ieTn+i} [Fi(x n ,y n ) + V n+1 {i) +d n+ i(i)]. 



Vn+l(j) -y n (j) 

€ 'j{<l>n+i(j))'ti-{j£j n+1 }[Gj(x n ,y n ) + U n+ i(j) + e n+ i(j)). 

Notice that the only change to the first process from Sections 2 and 3 is 
that the mean field now depends on y n as well as x n . It follows that the 
asynchronous and relative step sizes retain the same form. Recall these def- 
initions and extend them for the {y n }nGN process: 



an := m.axa{v n (i)), p n (i) := = %e/„}> 



n 



7n := ma_x7(0 n (j)j, a n (j) : = = % e J n }- 

j&Jn In 

As in Section 2.2 let M n be the K x K diagonal matrix of the /x n (i) terms 
and similarly let iV n be the L x L diagonal matrix of the cr n (j) terms. The 
coupled stochastic process (4.1) can be written more concisely as, 



(4.2) 



x n +i ~ 

Vn+l 



Vn 



a n+ iM n+1 

- 7n+l-/V n+ l 



V n +i + d n+ \ 
U n +\ + e n+ i 



€ a n+1 M n+1 ■ F(x 

n ) Vn ) ; 

€ 7 n+ iiV n+1 • G(x 

n i Un ) • 



Finally, define the two timescales; let tq := 0, 



i=l 



0. 



pk '■= Yli=i 7i- The division of time on the 'slow' timescale is given by the 
increments {f n } n ^ and similarly for {p n } n £f$ on the 'fast' timescale. In a 
similar manner to the previous sections let fh a (t) := sup{/c > 0; t > f^}, and 
m-y(t) := sup{& > 0; t > pk}. 

4.2. Assumptions. We state the assumptions (B1)-(B6) used for the con- 
vergence results of the two-timescale algorithm (4.2). These are exactly 
analogous to (Al)-(A5) and are simply extended to accommodate the two- 
timescales framework. The exceptions to this are (B2)(c) and (B6) and the 
slight adaptations to (B3), which are in line with those used by Borkar [7] 
and Konda and Borkar [13]. In (B4) we have produced a single combined 
Markov chain instead of one for each of the {£ri}neN an d {?/n}neN processes 
to present a clearer assumption. 

(Bl) (a) For compact sets, C C R K , D C K L , x n <E C, y n G D for all n. 
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(b) {<i n } nG N and {e n } nS N are bounded sequences such that d n , e n — > 
as n — > oo. 

(B2) The following must be true for a(n) = a(n) and a(n) = j(n) 

(a) a(n) = oo and a(n) — > as n — > oo, 

(b) For x € (0,1), sup n d([m])/fl(n) < A x < oo. In addition, for all 



(B3) (a) For all z G {(a;, y);x £ C,y £ D}, F{z) is upper semi-continuous 
and for all y £ D, F(-, y) : $L K — > W K is a stochastic approxima- 
tion map. 

(b) For all z £ {(x,y);x € C, y £ D}, G{z) is a stochastic approxi- 
mation map 

The first and second assumptions are direct extensions of (Al) and (A2) 
to two-timescales with the addition of (B2)(c) which is a standard two- 
timescale assumption used by Borkar [7]. Condition (B3)(a) is similar to 
(A3) for the 'slow' timescale, however this must hold for all values of the 
'fast' timescale. (B3)(b) is a similar condition for the 'fast' timescale. 

Define H C I x J such that if X 6 I and J 6 J then (1, J) G H if and 
only if X and J7" have a positive probability of occurring simultaneously (at 
the same iteration) . This means that H is the combination of elements across 
J X J which have positive probability of being updated at any particular iter- 
ation. At iteration n H n G H is taken to be the updated components across 
I and J. In addition, let z n = (x n ,y n ) G C x D and T n be a sigma algebra 
containing all the information up to and including the n th iteration. That is 
T n = a({H m } m ,{z m } m ,{v m (i)} i)m ,{(j) m (j)}j, m ]Vm < n,i = l,...,K,j = 



n, a(n) > a(n + 1). [•••]■ 




1,...,L). 
(B4) (a) 



For all z G C x D and H n ,H n +i G H, 



) 




Let 




(b) 




the map z !-)• Q^ n ^ n+1 )(z), is Lipschitz continuous. 
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(B5) Separately for both 



(a(n),W n , Z n ) = (a(n), V n , I n ) and (a(n),W n , Z n ) = (j(n), U n , J n ) 

one of the following assumptions is satisfied: 

(a) For some q > 2 

^a(n) 1+q/2 < oo, and supE(||W n || 9 ) < oo; 

n 

n 

(b) With W n (i) independent of I^ i( zz n } given F n -i, W n (i) indepen- 
dent of W n (j) for i ^ j and (a, 6) = J2k a k^k- There exists a 
positive r such that for all 9 € M. K /M L (depending on the dimen- 
sion of W n 



nit 



— c/a(n) ^ 
e ' v ' < oo, 



and 

for each c > 0. 



exp<j>,W„ + i)}|J- n ] <exp{-||0|| 2 }; 



If (B5)(a) holds for (a(n), V n , /„) and (B5)(b) holds for ( 7 (n), [/„, J n ) 
(or vice versa) then (B5) holds. 

(B4) and (B5) are straightforward extensions to (A4) and (A5) where in 
(B4) we have chosen to create a combined Markov chain over both I and 
J to present a clearer assumption. As a result of (B4), Lemma 3.4 gives 
that every element of H (and hence every element of I and J) is updated 
some minimum proportion of the time in the limit (see Section 3.3). Let 
e > be this minimum proportion. Define F(x, y) := £l £ K ■ F(x, y) and 
G(x,y) := Q e L • G(x,y) analogously to the definition of F(-) in (3.5), where 
defined in (3.4). 

(B6) For all x € C the differential inclusion, 

has a unique globally asymptotic stable equilibrium, A(x), where A(-) : 
M. K W L is bounded, continuous and single-valued for ieC. 
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The final assumption, (B6), is the asynchronous equivalent to the 'fast' 
timescale convergence criteria used by Borkar [7] and we use throughout 
this section. However, at the end of Section 4.4 we provide an alternative 
assumption, (B6'), which allows the 'fast' timescale to converge instead to 
a globally attracting set, as opposed to a continuous single- valued function. 

4.3. Convergence of the 'Fast' Timescale. Many of the proofs in this 
section use arguments which are identical to the corresponding results of 
Section 3; where this is the case we do not go into detail and instead direct 
the reader to the appropriate result (s) to identify the method used. 

Firstly, we require an additional lemma used by Konda and Borkar [13] 
which shows that the key two-timescales arguments made by Borkar [7] will 
continue to hold in the asynchronous case. 

Lemma 4.1. Under (B2) and (B4) a n ,7„ -4 and o almost 

surely. 

Proof. The proof of this result is identical to [13, lemma 4.6]. The req- 
uisite assumptions are encapsulated in (B2) and the result of Lemma A.l, 
in Appendix A.l. □ 

Again, we follow the method of Borkar [10] when examining the set- 
valued updates. Define f n and g n in the following manner; for i = 1, . . . , K , 
j = 1,...,L let 

(i)[V n+1 (i) + d n+1 (i)], 

0"n+l(j) [U n +i(j) + e n +i(j)] • 

As in Section 3.1, if ^ n+ i{i) = then we can select any f n (i) € Fi(x n ,y n ) 
and similarly if a n+ i(j) = then we can select any g n (j) £ Gj(x n ,y n ). f n 
and g n represent the realised values of F(x n ,y n ) and G(x n ,y n ) respectively. 
We express (4.2) as, 



f n (i)fJLn + i(i) 
9n(j)o- n +l{j) 



x n+1 (i) - x n (i) 
a n +i 

Vn+lU) ~ Vn(j) 
ln+1 



(4.3) 



x 



n+l = x n + a n+ iM n+ i [f n + V n+ i + d n+ i) 



Vn+1 = Un + Jn+lN n+ i [g n + U n+ i + e n+ i] 
Let 0^ be a zero vector of length K, and define 
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Xn\ ( M n \ ( ^V n 



/ ' " ' I AT / ' w ' I TT 



en+i / ' V CI 2 ") 

Then we can express the coupled process in (4.3) as a single iterative process, 



(4.4) z n+ i - z n - 7 n+ ir n+ i [Cn+i + e n+ i] € j n+1 T n+ i • &(z n ). 

This is in the same form as equation (2.4) which is examined throughout 
Section 3. The limiting behaviour of (4.4) can therefore be studied using the 
same method as in Section 3. Although assumptions (Al)-(A4) are embed- 
ded in (B1)-(B6), we negate the need for (A5) under (B1)-(B6) by immedi- 
ately proving the corresponding result in Lemma 3.3 for (4.4). 

Lemma 4.2. Under (B2)(b),(BJ h ) and (B5), with probability 1, 



lim sup < 



k-1 

7i+i r j+iCi+i 

j=n 



; k = n + 1, . . . , m 7 (p n + T) 

Proof. Firstly, let £*(T) := sup{& > 0;t > f n+k } and £n(T) := supjk > 
0; t > p n+/ t}. By Lemma 4.1, 7 n > a in the limit and hence > £n(i)- 

From this, for limiting values of n, 



(4.5) fh a (f n + T) = n + C(T) >n + £(T) = m^p n + T). 

Now, 



fc-i 



< 



fc-1 



+ 



^a J+ iM 3+1 y j+1 

k-1 

^2j j+1 N j+1 U j+1 



Using that the assumptions for Lemma 3.3 are contained in (B2)(b), (B4) 
and (B5), the first term converges to zero for k = n + 1, . . . , fh a (f n + T) 
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and with the same arguments, the second term converges to zero for k = 
n + 1, . . . , rh^(p n + T). Combining this with (4.5) proves the result for k = 
n + 1, . . . , fh^{p n + T) as required. □ 

Now we note that assumption (B4) allows us to use identical arguments 
to those in Lemma 3.4. Recall the definition of Qf. in (3.4). Then there exists 
an e > such that T n G £l £ K+L . Let z{t) be the linear interpolation of (4.4) 
in the same manner as the single timescale case in (3.7). Fix e > and let 
*(.) : = n £ K+L • *(■), #(•) : R K+L -> R K+L . 

Lemma 4.3. Under assumptions (B1)-(B6), with probability 1, z{t) G 
is an asymptotic pseudo-trajectory of the differential inclusion, 



dz 

(4.6) - G *(*). 

Proof. We have shown (Al)-(A4) hold for (4.4) and this is in the form 
of (2.4). Combining this and Lemma 4.2 we can use Theorem 3.1 to give the 
result immediately. □ 

Let the linear interpolations of the two-timescales in (4.3) be denoted by 
x(t) and y(t) respectively analogously to the single timescale case in (3.7). 

COROLLARY 4.4. Under assumptions (B1)-(B6), with probability 1, the 
interpolated process 



(x(t),y(t)) -)■ {(x,A(x));x G c} 



as t — >• oo. 



Proof. Immediate from Lemma 4.3 and (B6) using the same arguments 
as Borkar [7]. □ 

4.4. Convergence of the 'Slow' Timescale. Now since we have a function 
F(-, •) which depends on two variables, but we are treating one of these as 
fully calibrated to the other, we have a slightly different framework to that of 
Benaim, Hofbauer and Sorin [4]. Therefore we present a slight variation on 
their perturbed solution [4, Definition (II)]. Despite this we are still able to 
show that this is an asymptotic pseudo-trajectory to the desired differential 
inclusion. Hence the same convergence results still apply. To reduce notation, 
define F A (-) : R K — > R K as F A (x) := F(x,A(x)). Note that under (B3)(a) 
and (B6) F A (-) is a stochastic approximation map. Following our previous 
notation, let 
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F A (x) :=fl £ K -F A {x). 

Definition 4.5. A continuous function z : M + — > M. K is a jointly per- 
turbed solution to the differential inclusion, 



(4.7) - e F*(x), 

if 

(i) z is absolutely continuous. 

(ii) — V{t) — d(t) € Fh^ (z(i)), for almost every t > 0, and for some 
bounded d(t),5(t) — > 0, where 



F A (x) = {f eC;\\x' -x\\ <5,y'e D, 

\\y'-A(x')\\<5, inf \\f'-f\\<5}. 

f>€F(x',y>) 

(iii) t — > V(t) is a locally integrable function such that for all T > 

r-t+V 



lim sup 

t^oo 0< ^ <T 



V(s)ds 



0. 



The key difference between a jointly perturbed solution and a perturbed 
solution is that in the original work of Benaim, Hofbauer and Sorin [4] (and 
as in Section 3) the mean field depends on a single variable. In contrast 
to this, here the mean field depends on two variables. Hence in part (ii) 
we must allow for perturbations in both variables simultaneously instead of 
perturbing just the one. 

Lemma 4.6. Under assumptions (B1)-(B6) a jointly perturbed solution 
of (4.7) is also an asymptotic pseudo-trajectory to the flow induced by (4.7). 

Proof. The proof is identical to the proof of [4, Theorem 4.2] which 
establishes that a perturbed solution is an asymptotic pseudo-trajectory. □ 

Define M(t) and M n in an identical manner to the same terms in Section 
3.4. Corollary 4.4 allows us to consider the updates on the 'slow' timescale 
of the coupled process in (4.3) given by, 



(4.8) 



[V n +i + d n+ i] G a n+ iF(x n , y n ), 
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where, as in Section 3.4, V n+1 = M n+1 V n+ i + [M n+1 - M n+1 ]f n and d n+1 = 
M n _|_i(i n +i. As with the single timescale framework let x(t) be a linear in- 
terpolation of (4.8) in the same manner as (3.7). That is, 

x(r n + s) = x n + s — , s£[0,a n+1 ). 

a n +i 

Theorem 4.7. Under assumptions (B1)-(B6), with probability 1, x(-) is 
an asymptotic pseudo-trajectory to the differential inclusion (4.7). 

Proof. We show that x(-) is a jointly perturbed solution of (4.7) and 
hence by Lemma 4.6 it is an asymptotic pseudo-trajectory of (4.7). 

The proof is almost identical to the proof of [4, Proposition 1.3]. The 
differences come with our choice of 6(t) = \\x(t) — x m r t ) || + \\y m (t) — -M^m^))!!) 
the first term of which will converge to zero as in the original proof and the 
second of which converges to zero as \\y n — A(a%)|| — > almost surely as a 
result of the convergence of the 'fast' timescale. Clearly from Definition 4.5 
part (ii), F(x m ^,y m ^) C F^(x(t)). The rest of the proof completes as in 
[4, Proposition 1.3]. □ 

Corollary 4.8. If there is a globally attracting set, A, for the differen- 
tial inclusion (4.7) and assumptions (B1)-(B6) are satisfied, then the two- 
timescale iterative process (4.2) will almost surely converge to A. 

Proof. Immediate by combining Corollary 4.4 and Theorem 4.7 with [4, 
Proposition 3.27]. □ 

Remark 1. It should be clear that the methods in this chapter can 
be applied to a standard, synchronous, stochastic approximation with set- 
valued mean fields. In this case (B4) is trivially satisfied, (B2)(b) can be 
removed and the use of the sets £l e K , £l £ L can be replaced by a single K x K 
and L x L identity matrix respectively. This means that F(x,y) = F(x,y) 
and similarly G(x,y) = G(x,y). 

Remark 2. This framework allows for the 'fast' timescale to converge 
to a set of limit points. Assumption (B6) can be replaced with the following 

(B6') For all x S C the differential inclusion, 

has a globally attracting set, A(x) where A(-) : M. K — > M. L is an up- 
per semi-continuous set valued map, such that for all x £ C, A(x) is 
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compact, convex and non-empty. In addition, for all x G C and for all, 
F(x,A(x)) is a convex set-valued map, i.e., for all A, A' G A(x), 

aF(x,X) + (l-a)F(x,X') C F(ac, a\ + (1 - a)A'), for all a G [0,1]. 

Under (B6') F A (x) will still be a stochastic approximation map and, 
although the details of the arguments in Section 4.4 change slightly, the 
method remains almost identical. 

5. An Application: Learning in a Markov Decision Process. In 

this section we provide an example where our two-timescale asynchronous 
stochastic approximation approach is needed. The algorithm is an actor- 
critic style algorithm based upon estimating rewards and playing an e-greedy 
best response to these estimates. Similarly to Konda and Borkar [13] we 
make assumptions on the Markov decision process (MDP) and the coupled 
learning algorithm separately. These correspond to (B1)-(B6). 

Firstly, we begin by outlining a suitable infinite horizon, discounted re- 
ward MDP described by the tuple (S, A, P, r, (3). S is the state space of the 
MDP, A is the set of actions of the decision-maker (agent), and A(s) is the 
set of actions available to the agent in state s G S. P represents the form of 
the stochastic transitions and in what is to follow we take P ss > (a) to denote 
probability of transitioning from state s G S to s' G S when the agent has 
selected action a G A(s). The reward to the decision maker for selecting 
action a G A(s) is denoted by r(s, a), and /3 G (0, 1) is the discount factor. 

Let s n G S and a n G A(s n ) be the state and the action, respectively, 
selected by the decision-maker at iteration n. Assume that at every iter- 
ation the agent observes the state of the process and a noisy version of 
the reward received from the action they have chosen, denoted by R n . If 
F n '■= {a m ,s m ,R m -i;m = 1,... ,n} then E[R n \F n } = r(s n ,a n ), and we as- 
sume that R n has a finite variance. Let K := \S\ and AL4(s)) represent 
the set of probability distributions over A(s). Then we denote the combi- 
nation of K probability distributions as Ak ■= A(^4(l)) x . . . x A(A(K)). 
A strategy for state s G S is denoted by ir(s) G A[A(s)) and let it := 
(7r(l), . . . , 7r(.ff )) G Ak be a strategy over all states. n(s, a) is defined as the 
probability that action a is taken in state s. Players start with a strategy 7To; 
the MDP begins in a random state s\ G <S and the decision-maker selects an 
action a\ from ttq. The agent wishes to find a strategy, ir, to maximise their 
expected discounted reward, 



oo 




71=1 
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Define V w (s) for all s G S in the following manner, 



(5.1) 



aeA(s) L s'es 



For all s G <S there exists a maximum value of V n (s) for all 7r G [1]. Let 
7f G A# be a strategy such that V w (s) is maximal in (5.1) for all s 6 5. 7r is 
known as an optimal strategy for the MDP. Again we use a subscript n on 
7r(s, a), 7r(s) and 7r to represent the strategy of the player at iteration n. 

To reduce the complexity of the notation we assume that every state has 
the same number of actions which is denoted by A := \A(s)\ for any s£iS 
and we let m := KA. Note that having a different number of actions in 
each state does not affect the validity of this approach but does make the 
notation more cumbersome. 

We begin by placing restrictions on the learning rates and the MDP. 
In this algorithm we select learning rates {a(n)} ng N and {^{n)} n ^ which 
satisfy (B2) and (B5)(a) and use two asynchronous counters, 



In addition we assume that the set of transition probabilities, {P ss '(a)}s,s',a, 
form an aperiodic, irreducible, positive recurrent Markov chain. Moreover, 
at every iteration when the MDP is in the state s we enforce that every 
action a G A(s) is played with a non-zero minimum probability and that 
this holds for every s G S. Therefore for all s 6 S, a 6 A(s) and n > 0, 
7Tn(s, a) > e for some e > 0. 

Finally, before directly analysing the algorithm we present a method for 
verifying the global convergence of a standard differential inclusion in the 
form of (2.1). 

Definition 5.1. Let A C M A . A continuous function W : R K h-> R 
is a Lyapunov function for the differential inclusion (2.1) if it satisfies the 
following criteria. 

(i) W{y) < W{x) for all x G R K \A, y £ * t (s), t > 0. 

(ii) W(y) < W(x) for all x G A, y G $t(x), t > 0. 

Finding a Lyapunov function proves the global convergence of set-valued 
dynamical systems in the form of (2.1), a concept which is fully described 
by Benaim, Hofbauer and Sorin [4]. 



n 



n 




i=l 



2G 



Because we include the asynchronicity with the mean field the associated 
differential inclusions associated with our algorithm will be in the form, 



(5.2) xen s k - h(x) -x , 

where h(-) is set-valued. Let h'(x) := Jl^ • [h(x) — x\. We state slight modifi- 
cations a result of Konda and Borkar [13] and a result of Benai'm, Hofbauer 
and Sorin [5] to allow us to easily prove the convergence of differential inclu- 
sions in the form of (5.2). We do not state proofs for either of these as they 
are straightforward extensions of [13, Lemma 5.4] and [5, Theorem 3.10]. 
Let 

K S:a (7T) := r(s, a) + pJ2 PsAa)V*{s') - V*(s), 
s'es 

and K s (ir) be the A- vector of these terms for all a G A(s). For all s G S, 
a G A(s). In addition, let V w V(s) denote taking the partial derivative of V 
with respect to ir. 

Lemma 5.2. Let G be a vector field on Ak conditional on ir. If 

(G s (n),K s {n))>0, 

then 

(G(ir),V n V*(s)) > (G s (ir),K s (ir)) > 0. 

Theorem 5.3. Assuming that h'(x) is a stochastic approximation map 
and there exists a positive definite function W € C 1 (]R fc ,lR) such that if 
A = {W(x) =0;x eC} and x € C\A then for any uj G Q. s k , x' G h(x), 

(V x W(x),u(x' - x)) < 0. 
Then W(-) is a Lyapunov function for (5.2) with attracting set A. 

Using Lemma 5.2 and Theorem 5.3 we show the convergence of the fol- 
lowing algorithm. 

5.1. The Algorithm. This algorithm cannot be studied in the framework 
of Konda and Borkar [13] due to the Lipschitz continuous restriction they 
place on the mean fields of the coupled stochastic approximations. In this 
work we have relaxed this condition allowing the study of process which are 
based on the best response. Firstly, if {Q(s,a)} Sta is the set of action values 
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for a MDP, let b s (Q) := argmax ag _4( s ){(5(s, a)} be the best response set to 
{Q(s, a)} s ,a for state s G S. Use the following coupled algorithm to estimate 
action values and the optimal strategy for all s G S and a G A(s), 



(5.3) Q n+1 (s,a) = Q n {s,a) + j{^ n+1 (s, a))l{( s , a )=(s n+1 ,a n+1 )} 

x i? n+1 + /3U n (s n+2 ) - Q n {s,a) , 



(5.4) 



7T n+ l(s) = -K n (s) + /i(l/ n+ l(s))I{ s=Sn+1 } 6 s (Qn) ~ 7T n (s) 



where U n (s) = X^ae^(s) 7T n(s,a)Q n (s,a). The action a n is selected using an 
e-greedy version of the strategy 7r n . For all s G 5 and a £ «4(s) let 



7r^(s,a) := 7r n (s, a)(l - Ae) + e. 

Then P(a n+ i = a|s n+ i = s) = n^(s,a). We must verify that (B1)-(B6) 
hold for this algorithm. We do not directly verify that (Bl) holds for this 
algorithm, but as pointed out in Section 3.2 methods to do so are discussed 
elsewhere. Furthermore the choice of learning parameters verifies (B2) and 
the choice of mean field in (5.3) and (5.4) immediately give that (B3) and 
(B5) hold. 

For this algorithm we have J = {(s, a); s £ S , a G A(s)} and I = {s; s € 
S}. This gives that H = {Us,a),s);s € S,a € ^4(s)} and for simplicity we 
write, 

H = {(s,a);s E S,a G A(s)}. 

Following the notation of Section 4 we have that z n = (Q n , 7r n ). With 
U n ,U n+ i £ H such that U n = (s,a), U n+ i = (s',a'), then Q^ HniHn+l) {z) = 
TTn(s', a')P ss >(a). This shows that (B4)(a) is satisfied. Again using the nota- 
tion of Section 4 the set of transition probabilities are denoted, 

By assumption on {-P ss '(°)}s,s',a we have that (B4)(b) is satisfied. Since 
ir^s' ,a')P ss /(a) is a continuous function of 7r^, which similarly is a continu- 
ous function of 7r n G z n , (B4)(c) is satisfied. 

A consequence of (B4) from Appendix A.l is that in the limit every state 
of the MDP is visited a minimum proportion of the time, rj, for some rj > 0. 
Similarly, by placing the restriction that every action is selected with at 
least probability e for some e > then every state, action pair is taken a 
minimum proportion of the time, rj' , for some ?/ > 0. Using the approach of 
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Section 4 we do not explicitly need to know the values 77 and rf as we verify 
convergence for every 7/, 7/ > 0. 

Finally, we need to verify (B6). Define 

s'es 

and let Q' K {s) be the ^4-vector containing Q w (s,a) for all a € A(s). Let 
h(-, •) : A K x M. m -> M. m be defined such that, 

h,, a (*,Q) ■= r(s, a) + /3 ^ P ss > (a)V (s ; ) , 

s'es 

where V(s) = X^ae.A(s) n ( s i a )Q( s i a )- Let h s {ir,Q) be the ^4-vector of the 
h s ,a(^:Q) terms which means that h(ir, Q v ) = Q 71 . For fixed ir € Ak con- 
sider the differential inclusion 



(5.5) Q s (t) € ni ■ h s (n,Q s (t)) -Q s (t) 



for all s £ S, 



Lemma 5.4. Q w (s) is the unique asymptotically stable equilibrium to 
(5.5). 

Proof. h s (7r,Q s (t)} is a contraction mapping [22], [23]. Hence for any 

fixed uj € £l v A , Q s (t) — > Q n (s). Combining this with the note by Borkar [10, 
Chapter 7.4] proves the claim. □ 

From this it follows that the values in {Q n (s, a)} s , a converge to the true 
action values for the strategy ir. These values are Lipschitz continuous in ir, 
which ensures (B6) holds. Hence, Theorem 4.7 holds and the linear interpo- 
lation of the iterative process in (5.4) is an asymptotic pseudo-trajectory to 
the differential inclusion 



(5.6) 



i s (t)eff' b s (Q 



for all s G S, 



for some rj > 0. For a particular action a G A(s), let 7r Si0 (t) and 7r Sja (i) 
represent the individual components of ir s (t) and ir s (t) respectively, whilst 
ir(t) and ir(t) are the K x A matrices containing all of the ir s ,a{t) and ir s ,a(t) 
elements. 

With assumptions (B1)-(B6) satisfied all that remains is to show that the 
differential inclusion (5.6) has a globally attracting set. Corollary 4.8 will 
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then provide the convergence result of the coupled algorithm in (5.3) and 
(5.4). We note that (5.6) is in the form of (5.2) and hence we use Theorem 
5.3 to prove the global convergence. 

Lemma 5.5. For each s £ S fix an oj s £ £l v . Take any strategy n £ A# 
and for each s £ S select 2^ £ b s (Q—). With 

■k_ s := lo s [n s — 7r s ] , for all s £ S. 

Then for any s £ S, 

(V K VH8),±)>0. 

Proof. Fix a strategy 7r £ Ak and for all s £ S fix co s £ O* 7 and take 
£ s £ b s (Q K ). Let 

21g = oj s [£ s — 21s] ; for all s £ S. 

Consider, 



(i s ,-M2i)> = u s 



Y ls(a)QHs,a)- Y, Ks(a)QHs,a) 

a£A(s) a£A(s) 



CJ, 



a£A(s) a€A{s) 



The second term here is zero since ^ ag _4( s ) p(s, a)V 7T (s) = V w (s) for any 
p(s) £ A(A(s)). The first term is clearly positive by the definition of the 
best response. Hence 



{±s,k s (e))>o. 

Then using Lemma 5.2 gives the desired result. 



□ 



Now we use Theorem 5.3 and Lemma 5.5 to produce a Lyapunov function 
for (5.6) and hence prove the global convergence of the second algorithm 
given by (5.3) and (5.4). Let A := {tt;tt an optimal strategy}. 



Lemma 5.6. Fix tt as an optimal strategy. Then 
W(7r) = Y[v*(s)-V"(s) 
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is a Lyapunov function for the differential inclusion (5.6) and A is a globally 
attracting set for (5.6). 

Proof. We prove the claim by applying Theorem 5.3 to (5.6). Clearly 
W(-) is positive semi-definite since for all s € S, and any tt € Ak, V n (s) > 
V 7T (s), with equality for tt £ A. 

To prove the condition of Theorem 5.3 we note that for any fixed t > 0, 

(S/ n{t) W{n(t)),n(t)) = -J2{V 7L VHs),±) , 

for some strategy tt_ € Ak, and for each s E 5, a fixed w s 6 il^ and fr,, 6 
6 S (Q-) such that 

■k_ s = u; s [£g — 7r s ] , for all s G 5. 

Using Lemma 5.5 immediately gives that (V,r( t )W(7r(t)),7r(t)) < for all 
7r(t) G Afc\A and for any t > 0. Applying Theorem 5.3 proves the claim. □ 

Corollary 5.7. The coupled process (Q n ,Tr n ) from (5.3) and (5.4) con- 
verges to the limit (Q^^n), where tt is an optimal strategy and {Q w (s, a)} s , a 
is the set of associated action values. 

Proof. Lemma 5.6 shows that a Lyapunov function exists for the differ- 
ential inclusion (5.6); this with Corollary 4.8 proves the claim. □ 

6. Summary. We have combined the work of Benai'm, Hofbauer and 
Sorin [4] on differential inclusions with the work of Borkar and Konda [13] 
and Borkar [8] on asynchronous stochastic approximation in order to provide 
a framework for asynchronous stochastic approximation with a set-valued 
mean field. This enables us to modify the previous work on asynchronous 
stochastic approximation to use a set of assumptions which are straightfor- 
ward to verify a priori. 

Furthermore we extended the work of Konda and Borkar on asynchronous 
two-timescale stochastic approximation using this new framework. By allow- 
ing the mean fields to be updated using set-valued functions we provide a 
new result in two-timescale stochastic approximations which clearly applies 
to asynchronous and synchronous stochastic approximations. 

This approach provides a clear framework for single or multiple timescale 
asynchronous stochastic approximations with clear assumptions which dif- 
fer little from the synchronous case. Where previously the additional and 
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difficult to verify assumptions required for asynchronous stochastic approx- 
imations could be perceived as a reason to avoid their use, this framework 
removes many of these issues. 

In Section 5 we provided an example of a coupled learning algorithm 
for a discounted reward Markov decision process. We analysed the limiting 
behaviour using the results of Section 4. The algorithm uses a set-valued 
mean field based upon a best response style of actor-critic learning. We 
used the results of Section 4 to show convergence to an optimal strategy 
under a straightforward set of assumptions. This algorithm demonstrates 
the main value of the approach to asynchronous stochastic approximation 
in this paper. 



A.l. Minimum Update Proportion. For the asynchronous stochas- 
tic approximations in (2.4) we are interested in understanding how different 
components of x n get selected to be updated. The previous work on this 
makes the direct assumption that in the limit all the elements of I are 
updated in an equally spaced manner in the limit and some minimum pro- 
portion of the iterations [13]; this assumption is difficult to verify prior to 
running the process. In this work we use results on Markov chains via as- 
sumption (A4) as an alternative which can be checked a priori if we know 
the transition probabilities of the state process. 

Lemma A.l. Under (A4), there exists r\ > such that Mi € /, 



Proof. The values of I n £ I form a controlled Markov chain where I n +i 
depends on the current updated component, I n , and the value of the iterative 
process, x n . For x € C let ir x {-) be a stationary distribution for this Markov 
chain given by the transition probabilities P x ( m , •)> standard theory gives 
that, under (A4), tt x (-) exists, is unique, and for some 5 X > 0, tt x (I) > 5 X 
for all I £ I. Let rj = mm x ^c ^x, which exists and is positive since C is 
compact. Then for all X € J, x € C we have that 



APPENDIX A: OMITTED PROOFS 



lim inf 



Vn{i) 



> r], a.s.. 



n 




tt x (1) > 5 X > rj. 



For I £ I define 



32 



And let w(I) = w n (I)/n. 



k=l 



n n 1 ' 

(A.2) = tS n _i(Z) + i (l { j„=i } - ^n-l(X) 

This is in the form of a stochastic approximation with controlled Markovian 
noise as in [9]. 

Let w(t) be the linear interpolation of the {u> n }neN process. Using [9, 
Corollary 3.1] the limiting behaviour of the interpolated process, w(t), will 
be an asymptotic pseudo-trajectory to a differential equation, 



(a.3) *m = irx(t) - W ( t ). 

For a suitable process x(t) £ C based upon {£ n }neN- We know that tt x (I) > 
7] > for all n and hence the dynamics of w(t) can be expressed as, 

where is as defined in (3.4). This implies that any limit point of w(t) 
will be in fijL and hence liminf n _ i . 00 Wn & > 77 a.s. VZ 6 7. 

For i & I, define := {I £ /; i G X} then f n (i) = w n(Z), and 

so for some I £ /, 

r • f ^ r ■ c m "( Z ) ^ n 
lim mi > lim ini > 77, a.s.. U 

n— >oo n n— >oo ji 

A.2. Noise Conditions for an Asynchronous Stochastic Approx- 
imation. Let to = 0, T n = X^fc=i a (^) an d recall fo = 0, f n = Ylk=i®k- 
Denote the asynchronous noise term V n (i) = a ^(n)^ ^{i&I n } ^" (0 » ana - ^ ^ 
be the i^- vector of these terms. 

Lemma A.2. If {V n } n ^ is a martingale difference noise process and 
assuming (A2)(b) and (A4) hold; 
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(i) //sup„E 



\V \\ q 

\'n\\ 



< oo for some q > then almost surely, 



supE 



\V \\ q 

Vn\\ 



< OO. 



(ii) IfV n (i) independent of I n given T n -\ and V n {i) independent ofV n (j) 
for i ^ j. Let (a,b) = ^^a^bk- If there exists a V > such that for 
all 9 € R K , 



E 



cxp 



r, 



n+l}} 



< 



cxp 



i/ien almost surely there exists a T > suc/i £/iai /or a// & € 



E 



exp 



<exp{^||0|| 2 }. 



Proof. (i) Combine rj from Lemma A.l with (A2)(b) to give A ri > 1. 
Then ||V^(i)|| < ||A^V^(i)|| using (A2)(b). From this it is immediate 
that 



E 



\VnUW 



\V n 



< oo. 



(ii) Let V n (i) := I^g/i V n (i). Clearly \\V n \\ < \\Vn\\- I n addition let &\ be a 
K dimensional basis vector with a 1 in the i th term and everywhere 
else. From the assumption in lemma A. 2, 



E 



exp {(0iej,V^+i)j | T n <exp{^0 2 j. 



Using this gives the following, 



E 



exp j(0,V^ + i) j|.F,i 



E 
K 



K 



exp | ( ^2 9^ , V n+ i) | \T n 
J]E exp{<0 iei ,V; +1 >}| 

r 



i=l 
K 



<n»p{^. 2 }- 

i=l 



exp 
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JNotice that Q(n+1) ^{ i€ i n+1 } - a{n+1) %e/ n+1 } and Q ( n +i) 1S 
deterministic given .F n . 



E 



exp{<0,K +1 )}|^J = E [exp {(0, ^ K + i (0*) } l^n 



E 



r A - 

n 

i=l 



exp 



^ a(n + l) 
a(n + 1) x 'J 



Now using the independence of the V{ terms and letting 8 n (i) :- 

a(n+l) U ■ 



E 



exp 



{(0,V n+ i)} 



l\E exp{(0 n (i),K+i(i)e;)}| 

i=l 
K p 

JjE exp ^ei^i^), | J" n 



i=l 



Note that ||6» n (i)|| 2 < \\A v e\\ 2 , with A v taken from (A2)(b). Finally, 
this gives, 



E 



1 K ( r \ 

exp {(0, K +1 >}|7-„ < J] ( ex P { 2 H^WII 2 } ) . 



A 



<n(» P { r ' 4? — 



l 2 



})■ 



Letting T := ArA 2 which is constant, completes the proof of ( 



11 



□ 



Proof, (of Lemma 3.3) 

Firstly, define £ n (t) = sup{/c > 0; t > r n+ k\ and let £ n (t) = sup{/c > 0; t > 
f n +k}, and notice £ n (t) > £ n {t) since a n > a(n) from (A2)(b). Therefore, 

m(r n + T) = n + £ n (T) > ?i + £ n (T) = m(f„ + T). 
Using this in the following 
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sup 1 \\^2 a j+i M j+i v j+i ;k = n + l,..., m(r n + T) \, 

, k-1 

= sup <^ ^2 a (J + ;k = n + 1, . . . , m(f„ + T) 

k 1 j=n 

, k-l 

< sup <^ Y2 a + 1 )^7'+i ]k = n + 1, . . . , m(r„ + T) 



Combining Lemma A. 2 with [4, Proposition 1.4] with the martingale noise 
sequence {V n } ne ^ and step sizes {a(n)} n£ N gives that this latter term con- 
verges to zero as t — > oo. □ 

A. 3. Weak Convergence of the Relative Step Sizes. Firstly, we 
will need a result which follows from (A2)(b) and Lemma A.l, 



Lemma A. 3. Under assumptions (A2)(b) and (A4) 



(A.4) lim inf ( ^ ) > A~\ 

n->oo \ ol 



Proof. 



lim inf I a ( n ) ] = \[ m [ n f ' 



Ct n J n ->°° \a(v n (i)) 

( a(n) \ 
> hm ml — - — r- , a.s., 

n-xx> \a[nr]) J 

Where the first step must hold for some i € /, the second step follows from 
Lemma A.l and the last step is directly from (A2)(b). □ 

Proof, (of Lemma 3.4) 

Since L 2 ([0,T]) is a Hilbert space it is relatively compact and relatively 
sequentially compact using the Banach-Alaoglu Theorem [10, Appendix A], 
which guarantees that the sequence {tt"(-)}„ e N has a weakly convergent 
subsequence with a limit point in L 2 ([0, T]). Hence, there exists a limit 
point Ui(-) of (3.10). For a fixed T > 0, ««(•) must satisfy (3.9) for all 
/i(-) £ L 2 ([0, T]). Hence by showing that for an arbitrary fixed T and a single 
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h(-) any limit point is bounded below this is enough to prove the claim. Fix 
T > 0, select < v < T, and take h(t) = 1 for all t € [0, T]. Let {k(n)} nm be 

a subsequence of the natural numbers under which {u^ (-)}neN converges. 



Ui(s) 



ds = lim / Ui(f k ( n ) + s)d& 
lim / tij(s)ds 



n— >oo 



,T fc(n) 
"i(r fc{n) +'u)-l 

> lim V a i+ i/i i+ i(i), 

n—>oo ' — » 
j=k{n) 

= n 1 ™ E «(i + 1 )% e / J+1 }- 

j=k(n) 

The following part of the proof uses a slight modification to the result by Ma 
et al. [20, Theorem 2.2] combined with the stochastic approximation form 
of the updates w n from Lemma A.l given in (A. 2). The modification comes 
because the transition probabilities for the Markov chain on I depends on 
x n instead of w n . This requires only a straightforward modification to the 
proofs in Sections 4 and 5 of [20]. Using this modification of [20, Theorem 
2.2] gives, 



}™o E a ^ + VkuJi+i} 

j=k(n) 

= ,&,£ E «(i+i)%, +1 =x } , 

XeX(i) j=k(n) 

= lim E «(J + 1K(X). 

Z6X(i) j=fc(n) 

Now we combine the above with (A.l) to give 
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™(T k (n)+v)-l 

Km E + 



n— >oo 

XeX(i) j=k(n) 



m(T k(n) +v)-l 

> lim V V a(j + 1)77, 

rn>oo ^ — » ^ — • 
XeX(i) j=k(n) 



> lim > a 7+ i — z V- 

j=fc(n) 



Now using Lemma A. 3, 



m(T k(n) +v)-l . 

r - a(j + l) 

lim > — = ?7 

j=fc(n) J+1 



m(T fc(n) +j;)~l 

> lim Qj+iA" 1 ^, a.s., 



n— ¥00 

j=k(n) 



We convert the sum back to an integral to give, 

rn{f k ( n )+v)-l 
lim > Qj+iA" 1 ^ 
j=k(n) 

rv 

> lim / AZ x j] ds- lim a m(f + „ )+ iA^ 1 

rv 

= / A^^ds. 

JO 

Taking e = A~ l rj completes the proof. 

We shall now prove the important corollary to Lemma 3.4. 

PROOF. (Of Corollary 3.5) 
Begin by noting that 
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t+v r-t+v 

uAs)ds = lim / u s is) ds, 



lim / uAs)ds. 



'*+ r fe(n) 

Now let n = m(t + 7W n )) and note that iti(i) € [0,1]. 



f t + T k(n)+ v l-Tn+V 

lim / uAs) ds > lim / Ui(s)ds — an 

pi) 

= lim / uf(s) ds. 

n.->oo 7 



Now, U is a compact metric space [10, Appendix A] and hence is weakly 
countably (limit point) compact. This means the infinite subset {tt"(-)} ne pj 
has a limit in U, and this limit point must satisfy Lemma 3.4. This gives, 



\ Uj(s)ds> _lim / uf(s) ds > ve, a.s., 
Jt n ^°° Jo 

and since this statement is true for all t, v > 0, the statement follows. □ 

A. 4. Proof of Lemma 3.6. Extend our notation to continuous time by 
defining M{t) as the K x K diagonal matrix of the (ui(t), . . . , ux(t)) terms 
and recall that M(t) is the K x K diagonal matrix of the (vi(t), . . . , UR-(t)) 
terms. 

Let h(-) be a bounded continuous function on [f n , ffc] such that h(fj) = fj 
for all j = n, . . . , k. This will mean that h(-) € L 2 ([0, T]). Throughout this 
paper we consider the continuous interval [0, oo) divided into segments of 
length a n , and hence we can approximate the sum from Lemma 3.6 as an 
integral, 



fc-l 

^a i+1 fi(M i+l - M i+l ) 

i=n 

Now we note that u,i{t) and Vi(t) are extensions of {^ n (i)}n£N to continuous 
time which are constant on intervals [^Vi? ^n+i ), and that M(t) and M(t) 
are just matrices containing tij(t) and Vi(t) respectively. This will mean that 
M{t) = M m ^ + i and M{t) = Mfn(t)+i an d hence we have, 



(■Tk 



IT,, 



-1 

h{ f m{t)){ M fh{t)+l - M fh(t)+l) dt 
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(A.5) 



H T m(t)){ M m{t)+l ~ M m(t)+l) d t 



Tfe-l 



h(f m(t) ){M(t)-M(t))dt 



In order to prove the convergence to zero of this term we look to use (3.11), 
but a further expansion is required so that (A.5) is in the correct form. 



(A.6) 
(A.7) 
(A.8) 



< 



h(f Mt) )(M(t)-M(t))dt 

h(t)(M(t) - M(t)) dt 
M(t) [h(f m(t) ) - h(t)) dt 
M(t) [h(t) - h(f m{t) )} dt 



+ 



+ 



Tn 
Tk-1 



Now, to prove the claim of Lemma 3.6 we will show that each of these terms 
will converge to zero. Firstly, (A.6) converges to zero almost surely by (3.11). 

Both (A.7) and (A.8) are dealt with using the same technique, which we 
demonstrate for (A.8). 



Tk-l 



M{t)[h{t)-h(f Mt) )] dt 

k 

M(t)h(t) dt-J2 a i+1 h(fi)M(fi 



< 



k-1 



+ 



k-1 

^a i+l h{fi)M{fi) 



M(t)h{f Mt) )dt 



Tn 



in both of these terms the sum is a Riemann approximation to the integral 
using the left hand points of the partition [f n , f n+ i] for all n. As n — > oo the 
width of the partition tends to zero and hence the above terms tend to zero. 
This proves that (A.7) and (A.8) will converge to zero. □ 
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